to go along with
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton
Introduction to Statistical Learning with Applications in R by James, Witten, Hastie, and Tibshirani
geom_point()aes() functionwday.ggplot() function.aes() function.aes() function| year | Algeria | Brazil | Columbia |
|---|---|---|---|
| 2000 | 7 | 12 | 16 |
| 2001 | 9 | 14 | 18 |
| country | Y2000 | Y2001 |
|---|---|---|
| Algeria | 7 | 9 |
| Brazil | 12 | 14 |
| Columbia | 16 | 18 |
| country | year | value |
|---|---|---|
| Algeria | 2000 | 7 |
| Algeria | 2001 | 9 |
| Brazil | 2000 | 12 |
| Brazil | 2001 | 14 |
| Columbia | 2000 | 16 |
| Columbia | 2001 | 18 |
#(a)
babynames |>
group_by(year, sex) |>
summarize(totalBirths = sum(num))
#(b)
group_by(babynames, year, sex) |>
summarize(totalBirths = sum(num))
#(c)
group_by(babynames, year, sex) |>
summarize(totalBirths = mean(num))
#(d)
temp <- group_by(babynames, year, sex)
summarize(temp, totalBirths = sum(num))
#(e)
summarize(group_by(babynames, year, sex),
totalBirths = sum(num))filter()arrange()select()mutate()group_by()(year, sex)(year, name)(year, num)(sex, name)(sex, num)n_distinct(name)n_distinct(n)sum(name)sum(num)mean(num)babynames <- babynames::babynames |>
rename(num = n)
babynames |>
filter(name %in% c("Jane", "Mary")) |>
# just the Janes and Marys
group_by(name, year) |>
# for each year for each name
summarize(total = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year total
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(number = sum(num))# A tibble: 276 × 3
# Groups: name [2]
name year number
<chr> <dbl> <int>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(name))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(name)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
babynames |>
filter(name %in% c("Jane", "Mary")) |>
group_by(name, year) |>
summarize(n_distinct(num))# A tibble: 276 × 3
# Groups: name [2]
name year `n_distinct(num)`
<chr> <dbl> <int>
1 Jane 1880 1
2 Jane 1881 1
3 Jane 1882 1
4 Jane 1883 1
5 Jane 1884 1
6 Jane 1885 1
7 Jane 1886 1
8 Jane 1887 1
9 Jane 1888 1
10 Jane 1889 1
# ℹ 266 more rows
Error in `summarize()`:
ℹ In argument: `sum(name)`.
ℹ In group 1: `name = "Jane"` and `year = 1880`.
Caused by error in `base::sum()`:
! invalid 'type' (character) of argument
# A tibble: 276 × 3
# Groups: name [2]
name year `mean(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
# A tibble: 276 × 3
# Groups: name [2]
name year `median(num)`
<chr> <dbl> <dbl>
1 Jane 1880 215
2 Jane 1881 216
3 Jane 1882 254
4 Jane 1883 247
5 Jane 1884 295
6 Jane 1885 330
7 Jane 1886 306
8 Jane 1887 288
9 Jane 1888 446
10 Jane 1889 374
# ℹ 266 more rows
gdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countrygdpyeargdpvalcountry–countryggplot() code. Which data frame should you use?35
pivot_wider() on raw datapivot_longer() on raw data# A tibble: 18 × 11
Subject day_0 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 308 250. 259. 251. 321. 357. 415. 382. 290. 431. 466.
2 309 223. 205. 203. 205. 208. 216. 214. 218. 224. 237.
3 310 199. 194. 234. 233. 229. 220. 235. 256. 261. 248.
4 330 322. 300. 284. 285. 286. 298. 280. 318. 305. 354.
5 331 288. 285 302. 320. 316. 293. 290. 335. 294. 372.
6 332 235. 243. 273. 310. 317. 310 454. 347. 330. 254.
7 333 284. 290. 277. 300. 297. 338. 332. 349. 333. 362.
8 334 265. 276. 243. 255. 279. 284. 306. 332. 336. 377.
9 335 242. 274. 254. 271. 251. 255. 245. 235. 236. 237.
10 337 312. 314. 292. 346. 366. 392. 404. 417. 456. 459.
11 349 236. 230. 239. 255. 251. 270. 282. 308. 336. 352.
12 350 256. 243. 256. 256. 269. 330. 379. 363. 394. 389.
13 351 251. 300. 270. 281. 272. 305. 288. 267. 322. 348.
14 352 222. 298. 327. 347. 349. 353. 354. 360. 376. 389.
15 369 272. 268. 257. 278. 315. 317. 298. 348. 340. 367.
16 370 225. 235. 239. 240. 268. 344. 281. 348. 365. 372.
17 371 270. 272. 278. 282. 279. 285. 259. 305. 351. 369.
18 372 269. 273. 298. 311. 287. 330. 334. 343. 369. 364.
sleep_long <- sleep_wide |>
pivot_longer(cols = -Subject,
names_to = "day",
names_prefix = "day_",
values_to = "reaction_time")
sleep_long# A tibble: 180 × 3
Subject day reaction_time
<dbl> <chr> <dbl>
1 308 0 250.
2 308 1 259.
3 308 2 251.
4 308 3 321.
5 308 4 357.
6 308 5 415.
7 308 6 382.
8 308 7 290.
9 308 8 431.
10 308 9 466.
# ℹ 170 more rows
right_join()?36right_join()?37namebandplaysplays variable in a full_join()?38NANULLaddTen() function. The following output is a result of which map_*() call?39map(c(1,4,7), addTen)map_dbl(c(1,4,7), addTen)map_chr(c(1,4,7), addTen)map_lgl(c(1,4,7), addTen)[1] "11.000000" "14.000000" "17.000000"
map(c(1, 4, 7), addTen)map(list(1, 4, 7), addTen)map(data.frame(a=1, b=4, c=7), addTen)map(c(1, 4, 7), addTen)map(c(1, 4, 7), ~addTen(.x))map(c(1, 4, 7), ~addTen)map(c(1, 4, 7), function(hi) (hi + 10))map(c(1, 4, 7), ~(.x + 10))ifelse() function takes the arguments:43set.seed() function45wherever you are, make sure you are communicating with me when you have questions!
wherever you are, make sure you are communicating with me when you have questions!
no right answer here!
Yes! All the responses are reasons to make a figure.
aes() functionwday.aes() functionanswers may vary. I’d say c. putting the work in context. Others might say b. facilitating comparison or d. simplifying the story. However, I don’t think a correct answer is a. making the data stand out.
mean() (average) instead of the sum(). The other commands compute the total number of births broken down by year and sex.filter()(year, name)sum(num)running the different code chunks with relevant output.
-countryyeargdpval (if possible, good idea to name variables something different from the name of the data frame)pivot_longer() on raw data. The reference to the study is: Gregory Belenky, Nancy J. Wesensten, David R. Thorne, Maria L. Thomas, Helen C. Sing, Daniel P. Redmond, Michael B. Russo and Thomas J. Balkin (2003) Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: a sleep dose-response study. Journal of Sleep Research 12, 1–12.NA (it would be NULL in SQL)map_chr(c(1,4,7), addTen) because the output is in quotes, the values are strings, not numbers.map() function allows vectors, lists, and data frames as input.map(c(1, 4, 7), ~addTen). The ~ acts on functions that do not have their own name or that are defined by function(...). By adding the argument (.x) we’ve expanded the addTen() function, and so it needs a ~. The addTen() function all alone does not use a ~.